{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<i>Copyright (c) Recommenders contributors.</i>\n",
                "\n",
                "<i>Licensed under the MIT License.</i>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# DKN : Deep Knowledge-Aware Network for News Recommendation\n",
                "\n",
                "DKN \\[1\\] is a deep learning model which incorporates information from knowledge graph for better news recommendation. Specifically, DKN uses TransX \\[2\\] method for knowledge graph representation learning, then applies a CNN framework, named KCNN, to combine entity embedding with word embedding and generate a final embedding vector for a news article. CTR prediction is made via an attention-based neural scorer. \n",
                "\n",
                "## Properties of DKN:\n",
                "\n",
                "- DKN is a content-based deep model for CTR prediction rather than traditional ID-based collaborative filtering. \n",
                "- It makes use of knowledge entities and common sense in news content via joint learning from semantic-level and knowledge-level representations of news articles.\n",
                "- DKN uses an attention module to dynamically calculate a user's aggregated historical representaition.\n",
                "\n",
                "\n",
                "## Data format:\n",
                "\n",
                "### DKN takes several files as input as follows:\n",
                "\n",
                "- **training / validation / test files**: each line in these files represents one instance. Impressionid is used to evaluate performance within an impression session, so it is only used when evaluating, you can set it to 0 for training data. The format is : <br> \n",
                "`[label] [userid] [CandidateNews]%[impressionid] `<br> \n",
                "e.g., `1 train_U1 N1%0` <br> \n",
                "\n",
                "- **user history file**: each line in this file represents a users' click history. You need to set `history_size` parameter in the config file, which is the max number of user's click history we use. We will automatically keep the last `history_size` number of user click history, if user's click history is more than `history_size`, and we will automatically pad with 0 if user's click history is less than `history_size`. the format is : <br> \n",
                "`[Userid] [newsid1,newsid2...]`<br>\n",
                "e.g., `train_U1 N1,N2` <br> \n",
                "\n",
                "- **document feature file**: It contains the word and entity features for news articles. News articles are represented by aligned title words and title entities. To take a quick example, a news title may be: <i>\"Trump to deliver State of the Union address next week\"</i>, then the title words value may be `CandidateNews:34,45,334,23,12,987,3456,111,456,432` and the title entitie value may be: `entity:45,0,0,0,0,0,0,0,0,0`. Only the first value of entity vector is non-zero due to the word \"Trump\". The title value and entity value is hashed from 1 to `n` (where `n` is the number of distinct words or entities). Each feature length should be fixed at k (`doc_size` parameter), if the number of words in document is more than k, you should truncate the document to k words, and if the number of words in document is less than k, you should pad 0 to the end. \n",
                "the format is like: <br> \n",
                "`[Newsid] [w1,w2,w3...wk] [e1,e2,e3...ek]`\n",
                "\n",
                "- **word embedding/entity embedding/ context embedding files**: These are `*.npy` files of pretrained embeddings. After loading, each file is a `[n+1,k]` two-dimensional matrix, n is the number of words(or entities) of their hash dictionary, k is dimension of the embedding, note that we keep embedding 0 for zero padding. \n",
                "\n",
                "In this experiment, we used GloVe \\[4\\] vectors to initialize the word embedding. We trained entity embedding using TransE \\[2\\] on knowledge graph and context embedding is the average of the entity's neighbors in the knowledge graph.<br>\n",
                "\n",
                "## MIND dataset\n",
                "\n",
                "MIND dataset\\[3\\] is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.\n",
                "\n",
                "In this notebook we are going to use a subset of MIND dataset, **MIND demo**. MIND demo contains 500 users, 9,432 news articles  and 6,134 impression logs. \n",
                "\n",
                "For this quick start notebook, we are providing directly all the necessary word embeddings, entity embeddings and context embedding files."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Global settings and imports"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "metadata": {
                "pycharm": {
                    "is_executing": false
                }
            },
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "System version: 3.9.16 (main, May 15 2023, 23:46:34) \n",
                        "[GCC 11.2.0]\n",
                        "Tensorflow version: 2.7.4\n"
                    ]
                }
            ],
            "source": [
                "import warnings\n",
                "warnings.filterwarnings(\"ignore\")\n",
                "\n",
                "import os\n",
                "import sys\n",
                "from tempfile import TemporaryDirectory\n",
                "import tensorflow as tf\n",
                "tf.get_logger().setLevel(\"ERROR\") # only show error messages\n",
                "tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)\n",
                "\n",
                "from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources, prepare_hparams\n",
                "from recommenders.models.deeprec.models.dkn import DKN\n",
                "from recommenders.models.deeprec.io.dkn_iterator import DKNTextIterator\n",
                "from recommenders.utils.notebook_utils import store_metadata\n",
                "\n",
                "print(f\"System version: {sys.version}\")\n",
                "print(f\"Tensorflow version: {tf.__version__}\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Download and load data"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {
                "pycharm": {
                    "is_executing": false
                }
            },
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "100%|███████████████████████████████████████████████████████████████████████████████| 11.3k/11.3k [01:39<00:00, 113KB/s]\n"
                    ]
                }
            ],
            "source": [
                "tmpdir = TemporaryDirectory()\n",
                "data_path = os.path.join(tmpdir.name, \"mind-demo-dkn\")\n",
                "\n",
                "yaml_file = os.path.join(data_path, \"dkn.yaml\")\n",
                "train_file = os.path.join(data_path, \"train_mind_demo.txt\")\n",
                "valid_file = os.path.join(data_path, \"valid_mind_demo.txt\")\n",
                "test_file = os.path.join(data_path, \"test_mind_demo.txt\")\n",
                "news_feature_file = os.path.join(data_path, \"doc_feature.txt\")\n",
                "user_history_file = os.path.join(data_path, \"user_history.txt\")\n",
                "wordEmb_file = os.path.join(data_path, \"word_embeddings_100.npy\")\n",
                "entityEmb_file = os.path.join(data_path, \"TransE_entity2vec_100.npy\")\n",
                "contextEmb_file = os.path.join(data_path, \"TransE_context2vec_100.npy\")\n",
                "if not os.path.exists(yaml_file):\n",
                "    download_deeprec_resources(\"https://recodatasets.z20.web.core.windows.net/deeprec/\", tmpdir.name, \"mind-demo-dkn.zip\")\n",
                "    "
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Create hyper-parameters"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {
                "pycharm": {
                    "is_executing": false
                },
                "tags": [
                    "parameters"
                ]
            },
            "outputs": [],
            "source": [
                "EPOCHS = 10\n",
                "HISTORY_SIZE = 50\n",
                "BATCH_SIZE = 500"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {
                "pycharm": {
                    "is_executing": false
                }
            },
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "HParams object with values {'use_entity': True, 'use_context': True, 'cross_activation': 'identity', 'user_dropout': False, 'dropout': [0.0], 'attention_dropout': 0.0, 'load_saved_model': False, 'fast_CIN_d': 0, 'use_Linear_part': False, 'use_FM_part': False, 'use_CIN_part': False, 'use_DNN_part': False, 'init_method': 'uniform', 'init_value': 0.1, 'embed_l2': 1e-06, 'embed_l1': 0.0, 'layer_l2': 1e-06, 'layer_l1': 0.0, 'cross_l2': 0.0, 'cross_l1': 0.0, 'reg_kg': 0.0, 'learning_rate': 0.0005, 'lr_rs': 1, 'lr_kg': 0.5, 'kg_training_interval': 5, 'max_grad_norm': 2, 'is_clip_norm': 0, 'dtype': 32, 'optimizer': 'adam', 'epochs': 10, 'batch_size': 500, 'enable_BN': True, 'show_step': 10000, 'save_model': False, 'save_epoch': 2, 'write_tfevents': False, 'train_num_ngs': 4, 'need_sample': True, 'embedding_dropout': 0.0, 'EARLY_STOP': 100, 'min_seq_length': 1, 'slots': 5, 'cell': 'SUM', 'doc_size': 10, 'history_size': 50, 'word_size': 12600, 'entity_size': 3987, 'data_format': 'dkn', 'metrics': ['auc'], 'pairwise_metrics': ['group_auc', 'mean_mrr', 'ndcg@5;10'], 'method': 'classification', 'activation': ['sigmoid'], 'attention_activation': 'relu', 'attention_layer_sizes': 100, 'dim': 100, 'entity_dim': 100, 'transform': True, 'filter_sizes': [1, 2, 3], 'layer_sizes': [300], 'model_type': 'dkn', 'num_filters': 100, 'loss': 'log_loss', 'news_feature_file': '/tmp/tmpgy77utho/mind-demo-dkn/doc_feature.txt', 'user_history_file': '/tmp/tmpgy77utho/mind-demo-dkn/user_history.txt', 'wordEmb_file': '/tmp/tmpgy77utho/mind-demo-dkn/word_embeddings_100.npy', 'entityEmb_file': '/tmp/tmpgy77utho/mind-demo-dkn/TransE_entity2vec_100.npy', 'contextEmb_file': '/tmp/tmpgy77utho/mind-demo-dkn/TransE_context2vec_100.npy'}\n"
                    ]
                }
            ],
            "source": [
                "hparams = prepare_hparams(yaml_file,\n",
                "                          news_feature_file = news_feature_file,\n",
                "                          user_history_file = user_history_file,\n",
                "                          wordEmb_file=wordEmb_file,\n",
                "                          entityEmb_file=entityEmb_file,\n",
                "                          contextEmb_file=contextEmb_file,\n",
                "                          epochs=EPOCHS,\n",
                "                          history_size=HISTORY_SIZE,\n",
                "                          batch_size=BATCH_SIZE)\n",
                "print(hparams)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Train the DKN model"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "pycharm": {
                    "is_executing": false
                }
            },
            "outputs": [],
            "source": [
                "model = DKN(hparams, DKNTextIterator)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 8,
            "metadata": {
                "pycharm": {
                    "is_executing": false
                }
            },
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "{'auc': 0.5218, 'group_auc': 0.5071, 'mean_mrr': 0.1494, 'ndcg@5': 0.1539, 'ndcg@10': 0.2125}\n"
                    ]
                }
            ],
            "source": [
                "print(model.run_eval(valid_file))"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 9,
            "metadata": {
                "pycharm": {
                    "is_executing": false
                },
                "scrolled": true
            },
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "at epoch 1\n",
                        "train info: logloss loss:0.6945172200600306\n",
                        "eval info: auc:0.5929, group_auc:0.5633, mean_mrr:0.1834, ndcg@10:0.2511, ndcg@5:0.1939\n",
                        "at epoch 1 , train time: 39.8 eval time: 8.8\n",
                        "at epoch 2\n",
                        "train info: logloss loss:0.6527644917368889\n",
                        "eval info: auc:0.5877, group_auc:0.5499, mean_mrr:0.1891, ndcg@10:0.2542, ndcg@5:0.2013\n",
                        "at epoch 2 , train time: 36.0 eval time: 9.0\n",
                        "at epoch 3\n",
                        "train info: logloss loss:0.6361906168361505\n",
                        "eval info: auc:0.6013, group_auc:0.5799, mean_mrr:0.1999, ndcg@10:0.2703, ndcg@5:0.2078\n",
                        "at epoch 3 , train time: 36.0 eval time: 9.0\n",
                        "at epoch 4\n",
                        "train info: logloss loss:0.6205979473888874\n",
                        "eval info: auc:0.611, group_auc:0.5862, mean_mrr:0.1851, ndcg@10:0.2624, ndcg@5:0.1853\n",
                        "at epoch 4 , train time: 36.1 eval time: 8.9\n",
                        "at epoch 5\n",
                        "train info: logloss loss:0.6062351117531458\n",
                        "eval info: auc:0.6148, group_auc:0.5931, mean_mrr:0.1947, ndcg@10:0.2715, ndcg@5:0.1951\n",
                        "at epoch 5 , train time: 36.2 eval time: 9.0\n",
                        "at epoch 6\n",
                        "train info: logloss loss:0.5931083386143049\n",
                        "eval info: auc:0.6153, group_auc:0.5942, mean_mrr:0.2015, ndcg@10:0.2737, ndcg@5:0.2084\n",
                        "at epoch 6 , train time: 36.3 eval time: 9.3\n",
                        "at epoch 7\n",
                        "train info: logloss loss:0.582433108240366\n",
                        "eval info: auc:0.6268, group_auc:0.5981, mean_mrr:0.2011, ndcg@10:0.2765, ndcg@5:0.2085\n",
                        "at epoch 7 , train time: 36.4 eval time: 10.3\n",
                        "at epoch 8\n",
                        "train info: logloss loss:0.5735978713879982\n",
                        "eval info: auc:0.6263, group_auc:0.6052, mean_mrr:0.2034, ndcg@10:0.279, ndcg@5:0.217\n",
                        "at epoch 8 , train time: 36.8 eval time: 9.2\n",
                        "at epoch 9\n",
                        "train info: logloss loss:0.5567030770083269\n",
                        "eval info: auc:0.62, group_auc:0.5958, mean_mrr:0.1942, ndcg@10:0.2688, ndcg@5:0.2019\n",
                        "at epoch 9 , train time: 39.3 eval time: 11.0\n",
                        "at epoch 10\n",
                        "train info: logloss loss:0.5417348792155584\n",
                        "eval info: auc:0.6198, group_auc:0.6035, mean_mrr:0.1929, ndcg@10:0.2692, ndcg@5:0.201\n",
                        "at epoch 10 , train time: 46.3 eval time: 13.2\n"
                    ]
                },
                {
                    "data": {
                        "text/plain": [
                            "<recommenders.models.deeprec.models.dkn.DKN at 0x7f2341f7bb50>"
                        ]
                    },
                    "execution_count": 9,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "model.fit(train_file, valid_file)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Evaluate the DKN model\n",
                "\n",
                "Now we can check the performance on the test set:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 10,
            "metadata": {
                "pycharm": {
                    "is_executing": false
                }
            },
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "{'auc': 0.6227, 'group_auc': 0.5963, 'mean_mrr': 0.2014, 'ndcg@5': 0.2066, 'ndcg@10': 0.28}\n"
                    ]
                }
            ],
            "source": [
                "res = model.run_eval(test_file)\n",
                "print(res)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Record results for tests - ignore this cell\n",
                "store_metadata(\"auc\", res[\"auc\"])\n",
                "store_metadata(\"group_auc\", res[\"group_auc\"])\n",
                "store_metadata(\"ndcg@5\", res[\"ndcg@5\"])\n",
                "store_metadata(\"ndcg@10\", res[\"ndcg@10\"])\n",
                "store_metadata(\"mean_mrr\", res[\"mean_mrr\"])\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## References\n",
                "\n",
                "\\[1\\] Wang, Hongwei, et al. \"DKN: Deep Knowledge-Aware Network for News Recommendation.\" Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2018.<br>\n",
                "\\[2\\] Knowledge Graph Embeddings including TransE, TransH, TransR and PTransE. https://github.com/thunlp/KB2E <br>\n",
                "\\[3\\] Wu, Fangzhao, et al. \"MIND: A Large-scale Dataset for News Recommendation\" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>\n",
                "\\[4\\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/"
            ]
        }
    ],
    "metadata": {
        "celltoolbar": "Tags",
        "interpreter": {
            "hash": "3a9a0c422ff9f08d62211b9648017c63b0a26d2c935edc37ebb8453675d13bb5"
        },
        "kernelspec": {
            "display_name": "Python 3 (ipykernel)",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.9.16"
        },
        "pycharm": {
            "stem_cell": {
                "cell_type": "raw",
                "metadata": {
                    "collapsed": false
                },
                "source": []
            }
        }
    },
    "nbformat": 4,
    "nbformat_minor": 2
}
